library(broom)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggridges)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(kableExtra)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(parsnip)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(randomForestExplainer)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(rsample)
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.1.0     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x lubridate::as.difftime() masks base::as.difftime()
## x dplyr::combine()         masks randomForest::combine()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks rstatix::filter(), plotly::filter(), stats::filter()
## x dplyr::group_rows()      masks kableExtra::group_rows()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x purrr::lift()            masks caret::lift()
## x randomForest::margin()   masks ggplot2::margin()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

Literature Review

The promotion of public transportation, despite being a common vision among nations, still is encountering road safety issues with the growing bus accidents in some countries. In emerging countries, e.g. Nepal and India, drivers and driving behaviour had been the leading cause of accidents, followed by external factors and vehicle conditions (Pearce & Maunder, 2000). Although as a developed country, investigations on the common factors mentioned above and the need to discover more relevant ones is necessary for Australia.

Truong and Currie (2019) assess the effectiveness of riding public transportation as a means of overcoming road safety issues. Types of transportation are individually measured in terms of their relative impacts. The CAR model takes into account factors, e.g. the age of the population in the sampled area, the speed limits and more, to study the causation between the environment and crashes. The ultimate result proves buses to be the safest transportation. A percentage point of increase in the proportion of riding buses could reduce 5.7. Further, factors relevant to buses - signalised intersections, public transport stops and roads with speed limits above 100km/h are positively associated with crashes.

Samerei, Aghabayk, Mohammadi and Shiwakoti (2021) follow up to consider the related factors’ relationship with the fatality due to bus accidents in Australia. Clustering analysis and association rules are applied respectively so that simultaneous or conditional events of factors can be revealed. From the clustering analysis, the largest proportion of total crashes are occupied by collisions with motor vehicles on weekdays (55%). Weekends are of great danger, accumulating the most fatality and serious injury, compared with other clusters. From the analysis of association rules, on weekdays, roads with speed limits over 50 are more likely to have bus collisions with motor vehicles than those without. Old, male drivers are associated with a 1.98 increase in cases except colliding with motor vehicles on weekdays. On weekends, the existence of pedestrians on highways is of high risk with 15.35 times of fatality increase. Additionally, darkness is more likely to incur the fatality of bus accidents in Australia.

In contrast, from a preventive perspective, Porcu, Olivo, Maternini and Barabino (2019) establish a risk index methodology to examine bus safety in Cagliari, Italy. The probability, severity of bus accidents and exposing factors are incorporated into the risk assessment index. Unlike Truong and Currie (2019), relevant variables are not explicitly specified in the function, however are categorised under six broad factors: infrastructure, driver, vehicle, etc. Based on model selection, it draws to the conclusion that in Cagliari, medium, short sized buses traveling on roads with two to three lanes with bus priority are conditions that could result in a low-risk index. Different from the results in Australia (Samerei et al., 2021), night is of less danger in Cagliari. On the opposite, drivers of standard-sized buses driving on roads with wide sidewalks and neighbourhood roads should be more aware as these factors tend to raise the risk level. Winter is also found to be more dangerous.

Unlike countries e.g. the US and the UK, the operation of buses in Malaysia has not been well managed over years with the growing bus accidents in the intercity. Law, Daud, Hamid and Haron (2017) adopt a similar idea as Porcu et al. (2019) by creating Safety Performance Index (SPI) to evaluate the risk based on the attributable factors. Through the aggregation and classification of the factors, road environment condition accounts for the most detrimental cause of bus accidents. Two-lane roads, narrow road shoulders and nighttime trips are positively associated with the most risk. From the data exploration, factors e.g. using mobile phones when driving, drivers exceeding the speed limits and the low quality of tires, are identified to be hazardous.

Bus priority policy is focused by Goh, Currie, Sarvi and Logan (2014) to discover the weight it has placed on resolving bus accident issues in Melbourne. Prediction models are built with the potential factors as the inputs and the corresponding parameters. The better-performing model acknowledges bus priority as a useful approach in maintaining road safety with a reduction of 53.3% in accident frequency. Further, it reveals the tendency for accidents to occur more often as traffic volume, route length and service frequency increase. Again, it verifies the positive correlation between bus stop density and accident frequency, as found in Truong and Currie (2019).

In short, bus stop density and route length especially highways are the common factors addressed across Australian-based articles that more attention should be given to in response to bus safety. From the different regions discussed, bus priority plays a significant role in reducing environmental inference for bus drivers. The effect of darkness varies between countries, yet is contributing towards bus accidents in Australia.

Data wrangling

driver_info <- read_csv("data/Driver_Info.csv") %>%
  filter(!Depot == "Not Given")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   EmployeeID = col_character(),
##   Depot = col_character(),
##   CompanyID = col_character(),
##   Gender = col_character(),
##   EmployeeStartDate = col_character(),
##   EmploymentTerminationDate = col_character(),
##   Position = col_character(),
##   ContractType = col_character(),
##   FTE = col_character(),
##   BirthDate = col_character()
## )
incidents <- read_csv("data/Incidents.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   ClaimNumber = col_character(),
##   EmployeeNumber = col_character(),
##   ClaimCategoryId = col_double(),
##   ClaimCategoryName = col_character(),
##   TypeOfClaimId = col_double(),
##   TypeOfClaimName = col_character(),
##   TypeOfPartyAtFault = col_character(),
##   DateOccurred = col_datetime(format = ""),
##   company = col_character()
## )
nswshifts <- read_csv("data/NSWshifts.csv") %>%
  distinct()
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   VdsEmployeeId = col_character(),
##   VdsRouteId = col_character(),
##   VdsCompany = col_character(),
##   VdsTripStartTime = col_datetime(format = ""),
##   VdsTripEndTime = col_datetime(format = "")
## )
melbourneweather <- read_csv("data/melbourneweather_20150101_20210428.csv")
## Warning: Duplicated column names deduplicated: 'Melbourne' => 'Melbourne_1' [3],
## 'Melbourne' => 'Melbourne_2' [4], 'Melbourne' => 'Melbourne_3' [5], 'Melbourne'
## => 'Melbourne_4' [6], 'Melbourne' => 'Melbourne_5' [7]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   location = col_character(),
##   Melbourne = col_character(),
##   Melbourne_1 = col_character(),
##   Melbourne_2 = col_character(),
##   Melbourne_3 = col_character(),
##   Melbourne_4 = col_character(),
##   Melbourne_5 = col_character()
## )
sydneyweather <- read_csv("data/sydneyweather_20150101_20210415.csv")
## Warning: Duplicated column names deduplicated: 'Sydney' => 'Sydney_1' [3],
## 'Sydney' => 'Sydney_2' [4], 'Sydney' => 'Sydney_3' [5], 'Sydney' =>
## 'Sydney_4' [6], 'Sydney' => 'Sydney_5' [7]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   location = col_character(),
##   Sydney = col_character(),
##   Sydney_1 = col_character(),
##   Sydney_2 = col_character(),
##   Sydney_3 = col_character(),
##   Sydney_4 = col_character(),
##   Sydney_5 = col_character()
## )
incidents <- incidents %>%
  unite(EmployeeID, c("company", "EmployeeNumber"), remove = FALSE)
nswshifts$VdsCompany <- toupper(nswshifts$VdsCompany)
nswshifts <- nswshifts %>%
  unite(EmployeeID, c("VdsCompany", "VdsEmployeeId"), remove = FALSE)


tmp <- incidents %>%
  filter(company == 'TDNSW') %>%
  select(ClaimNumber, EmployeeID, ClaimCategoryName, TypeOfClaimName, DateOccurred) %>%
  mutate(JoinDate = as.Date(DateOccurred)) %>%
  left_join(nswshifts %>%
              mutate(JoinDate = as.Date(VdsTripStartTime)),
            by = c("EmployeeID","JoinDate")) %>%
  mutate(match_type = case_when(
    is.na(VdsTripStartTime) ~ 0,
    DateOccurred >= VdsTripStartTime & DateOccurred <= VdsTripEndTime ~ 2,
    TRUE ~ 1
  ))
matched <-
  tmp %>%
  inner_join(
    tmp %>% group_by(ClaimNumber) %>% summarise(match_type = max(match_type)),
    by = c("ClaimNumber", "match_type")
  ) %>%
  arrange(match_type,ClaimNumber,VdsTripStartTime)

tmp2 <- incidents %>%
  filter(company == 'TDNSW') %>%
  select(ClaimNumber, EmployeeID, ClaimCategoryName, TypeOfClaimName,DateOccurred) %>%
  mutate(JoinDate = as.Date(DateOccurred)) %>%
  left_join(nswshifts %>%
              mutate(JoinDate = as.Date(VdsTripStartTime)) %>%
              group_by(EmployeeID, JoinDate) %>%
              summarise(shift_start = min(VdsTripStartTime)-30*60,
                        shift_end = max(VdsTripEndTime)+30*60),
            by = c("EmployeeID","JoinDate")) %>%
  mutate(match_type = case_when(
    is.na(shift_start) ~ 0,
    DateOccurred >= shift_start & DateOccurred <= shift_end ~ 2,
    TRUE ~ 1
  ))
## `summarise()` has grouped output by 'EmployeeID'. You can override using the `.groups` argument.
matched <-
  tmp2 %>%
  inner_join(
    tmp2 %>% group_by(ClaimNumber) %>% summarise(match_type = max(match_type)),
    by = c("ClaimNumber", "match_type")
  ) %>%
  arrange(match_type,ClaimNumber,shift_start)

tmp <- tmp %>%
  mutate(day_of_week = weekdays(as.Date(DateOccurred)))

tmp2 <- tmp2 %>%
  mutate(day_of_week = weekdays(as.Date(DateOccurred)))
melbourneweather <- melbourneweather %>%
  slice(-1:-9) %>%
  select(location, Melbourne, Melbourne_1) %>%
  rename(c(timestamp = location, `Melbourne Temperature` = Melbourne, `Melbourne Precipitation Total` = Melbourne_1)) 
melbourneweather <- melbourneweather %>%
        mutate(timestamp = ymd_hm(timestamp))
sydneyweather <- sydneyweather %>%
  slice(-1:-9) %>%
  select(location, Sydney, Sydney_1) %>%
  rename(c(timestamp = location, `Sydney Temperature` = Sydney, `Sydney Precipitation Total` = Sydney_1)) 
sydneyweather <- sydneyweather %>%
        mutate(timestamp = ymd_hm(timestamp))

Questions:

1) How are drivers’ characteristics (Depot, gender and age) affecting the incidents? Do they genuinely contribute to the occurrence of incidents?

# all drivers in each depot
drvr_inc_dpt <- incidents %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes")) %>%
  group_by(Depot, CompanyID, driver_incident, EmployeeID) %>%
  summarise(driver_depot = n_distinct(EmployeeID)) %>%
  filter(driver_incident == "Yes") %>%
  group_by(Depot, CompanyID) %>%
  summarise(num_drv_inc = sum(driver_depot))
## `summarise()` has grouped output by 'Depot', 'CompanyID', 'driver_incident'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'Depot'. You can override using the `.groups` argument.
total_drvr_dpt <- driver_info %>%
  group_by(Depot, CompanyID) %>%
  summarise(num_of_drvr = n())
## `summarise()` has grouped output by 'Depot'. You can override using the `.groups` argument.
drvr_inc_dpt %>%
  left_join(total_drvr_dpt, by = c("Depot", "CompanyID")) %>%
  mutate(has_incident_dpt = (num_drv_inc/num_of_drvr)*100) %>%
  mutate(across(has_incident_dpt, round, 2)) %>%
  ggplot(aes(x = Depot, y = has_incident_dpt, fill = CompanyID)) +
  geom_bar(stat = "identity") + 
  geom_text(aes(label = has_incident_dpt), vjust=1.6, color="white", size=2.5) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(y = "Drivers had claims", fill = "Company", title = "Proportion of drivers had claims by Depot")

severity_inc <- incidents %>%
  mutate(severity = case_when(TypeOfClaimName %in% c("MVA - Hit by Third Party", "MVA - Hit parked vehicle", "Customer Injury- Heavy Braking", "MVA - Hit moving vehicle while driving straight", "MVA - Hit moving vehicle while turning", "MVA - Hit stationary vehicle while turning", "MVA - Hit stationary vehicle driving straight", "MVA - Hit illegally parked vehicle", "MVA - Hit moving vehicle while changing lanes", "Pedestrian Injury - Hit by Transdev vehicle", "MVA - Hit bus station", "Employee Injury - Motor Vehicle Accident", "Motor Vehicle Collision", "Collision with Public") ~ "serious",
                              TypeOfClaimName %in% c("MVA - Hit tree/ tree branch", "MVA - Hit barrier/ bollard", "MVA - Near Miss", "MVA - Reversing", "MVA bus stop sign", "MVA bus stop pole") ~ "slight")) %>%
  filter(severity %in% c("serious", "slight"))

inc_severity_dept <- severity_inc %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes")) %>%
  group_by(Depot, CompanyID, driver_incident, EmployeeID, severity) %>%
  summarise(driver_depot = n_distinct(EmployeeID)) %>%
  filter(driver_incident == "Yes") %>%
  group_by(Depot, severity, CompanyID) %>%
  summarise(num_drv_inc = sum(driver_depot))
## `summarise()` has grouped output by 'Depot', 'CompanyID', 'driver_incident', 'EmployeeID'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'Depot', 'severity'. You can override using the `.groups` argument.
total_drvr_dpt <- driver_info %>%
  group_by(Depot, CompanyID) %>%
  summarise(num_of_drvr = n())
## `summarise()` has grouped output by 'Depot'. You can override using the `.groups` argument.
svr_dpt <- inc_severity_dept %>%
  left_join(total_drvr_dpt, by = c("Depot", "CompanyID")) %>%
  mutate(has_incident_dpt = (num_drv_inc/num_of_drvr)*100) %>%
  mutate(across(has_incident_dpt, round, 2)) %>%
  pivot_wider(names_from = severity, values_from = has_incident_dpt) %>%
  mutate(serious = replace_na(serious, 0),
         slight = replace_na(slight, 0))

svr_dpt %>%
  mutate(serious_label = paste0(CompanyID," - Serious"),
         slight_label = paste0(CompanyID," - Slight")) %>%
  plot_ly(x = ~Depot, color = ~CompanyID) %>%
  add_trace(y = ~serious, name = ~serious_label, type='bar',width = 0.3) %>%
  add_trace(y = ~slight, name = ~slight_label, type='bar', width=0.3,opacity=0.5) %>%
  layout(
    title = "Proportion of drivers had claims by Depot and Severity",
    yaxis = list(title = "Drivers had claims (%)",
                 range=c(0,100)),
    updatemenus = list(
      list(
        type = "buttons",
        #label = 'Category',
        buttons = list(
          list(method = "restyle",
               args = list('visible', c(rep(TRUE,3), rep(FALSE,3))),
               label = "Serious Only"),
          list(method = "restyle",
               args = list('visible', c(rep(FALSE,3), rep(TRUE,3))),
               label = "Slight Only"),
          list(method = "restyle",
               args = list('visible', TRUE),
               label = "Show All")
        )
      )
    )
  )

Gender

drvr_inc_gdr <- incidents %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes")) %>%
  group_by(Gender, CompanyID, driver_incident, EmployeeID) %>%
  summarise(driver_gdr = n_distinct(EmployeeID)) %>%
  filter(driver_incident == "Yes") %>%
  group_by(Gender, CompanyID) %>%
  summarise(num_drv_inc = sum(driver_gdr))
## `summarise()` has grouped output by 'Gender', 'CompanyID', 'driver_incident'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
total_drvr_gdr <- driver_info %>%
  group_by(Gender, CompanyID) %>%
  summarise(num_of_drvr = n())
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
drvr_inc_gdr %>%
  full_join(total_drvr_gdr, by = c("Gender", "CompanyID")) %>%
  mutate(num_drv_inc = replace_na(num_drv_inc, 0)) %>%
  mutate(has_incident_gdr = (num_drv_inc/num_of_drvr)*100) %>%
  mutate(across(has_incident_gdr, round, 2)) %>%
  ggplot(aes(x = Gender, y = has_incident_gdr, fill = Gender)) +
  geom_bar(stat = "identity") + 
  geom_text(aes(label = has_incident_gdr), vjust=1.6, color="white", size=2.5) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_minimal() +
  facet_grid(.~CompanyID) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(y = "Drivers had claims", fill = "Gender", title = "Proportion of drivers had claims by Gender and company")

inc_severity_gdr <- severity_inc %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes")) %>%
  group_by(Gender, CompanyID, driver_incident, EmployeeID, severity) %>%
  summarise(driver_depot = n_distinct(EmployeeID)) %>%
  filter(driver_incident == "Yes") %>%
  group_by(Gender, severity, CompanyID) %>%
  summarise(num_drv_inc = sum(driver_depot))
## `summarise()` has grouped output by 'Gender', 'CompanyID', 'driver_incident', 'EmployeeID'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'Gender', 'severity'. You can override using the `.groups` argument.
total_drvr_gdr <- driver_info %>%
  group_by(Gender, CompanyID) %>%
  summarise(num_of_drvr = n())
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
svr_gdr <- inc_severity_gdr %>%
  left_join(total_drvr_gdr, by = c("Gender", "CompanyID")) %>%
  mutate(has_incident_gdr = (num_drv_inc/num_of_drvr)*100) %>%
  mutate(across(has_incident_gdr, round, 2)) %>%
  pivot_wider(names_from = severity, values_from = has_incident_gdr) %>%
  mutate(serious = replace_na(serious, 0),
         slight = replace_na(slight, 0))

svr_gdr %>%
  mutate(serious_label = paste0(CompanyID," - Serious"),
         slight_label = paste0(CompanyID," - Slight")) %>%
  plot_ly(x = ~Gender, color = ~CompanyID) %>%
  add_trace(y = ~serious, name = ~serious_label, type='bar') %>%
  add_trace(y = ~slight, name = ~slight_label, type='bar',opacity=0.5)  %>%
  layout(
    yaxis = list(title = "Drivers had claims (%)",
      range=c(0,100)
    ),
    title = "Proportion of drivers had claims by Gender and Severity",
    updatemenus = list(
      list(
        type = "buttons",
        label = 'Category',
        buttons = list(
          list(method = "restyle",
               args = list('visible', c(rep(TRUE,3), rep(FALSE,3))),
               label = "Serious Only"),
          list(method = "restyle",
               args = list('visible', c(rep(FALSE,3), rep(TRUE,3))),
               label = "Slight Only"),
          list(method = "restyle",
               args = list('visible', TRUE),
               label = "Show All")
        )
      )
    )
  )

MVA

gen_inc <- incidents %>%
  filter(grepl('MVA|Motor', ClaimCategoryName)) %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes")) %>%
  group_by(Gender, driver_incident, EmployeeID) %>%
  summarise(driver_depot = n_distinct(EmployeeID)) %>%
  group_by(Gender, driver_incident) %>%
  summarise(num_drv_inc = sum(driver_depot))
## `summarise()` has grouped output by 'Gender', 'driver_incident'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'Gender'. You can override using the `.groups` argument.
tot_drv_gen <- driver_info %>%
  distinct(EmployeeID, Gender) %>%
  group_by(Gender) %>%
  summarise(num_of_drvr = n())

gen_inc_tab <- gen_inc %>%
  pivot_wider(names_from = driver_incident, values_from = num_drv_inc) %>%
  left_join(tot_drv_gen, by = "Gender") %>%
  mutate(has_incident_gdr = (Yes/num_of_drvr)*100) %>%
  mutate(across(has_incident_gdr, round, 2)) %>%
  rename(incident = Yes, `no incident` = No, total = num_of_drvr, prop = has_incident_gdr) 
gen_inc_tab %>%
  kable(caption = "Contingency table for gender and incident") %>%
  kable_styling()
Contingency table for gender and incident
Gender no incident incident total prop
Female 225 151 376 40.16
Male 1566 2298 3864 59.47
test_x <- as.matrix(gen_inc_tab %>%
                      ungroup() %>%
                      select(incident, `no incident`))
CLTest_gen <- prop.test(test_x)
tidy(CLTest_gen) %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped")
estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
0.4015957 0.5947205 51.59346 0 1 -0.2464959 -0.1397536 2-sample test for equality of proportions with continuity correction two.sided

0.02 < 0.05 Reject the null hypothesis. The data provides strong evidence that gender does have an effect on incidents.

The small ( <5%) p-value suggests that the null hypothesis of equal proportions can be rejected. Difference in gender doesn’t have an effect on incidents.

Age

inc_age <- incidents %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears(),
         driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes"))
inc_age$driver_age <- round(inc_age$driver_age, digits = 0)
inc_age <- inc_age %>%
  mutate(age_group = case_when(driver_age >= 20 & driver_age <= 30 ~ "20-30",
                               driver_age >= 31 & driver_age <= 40 ~ "31-40",
                               driver_age >= 41 & driver_age <= 50 ~ "41-50",
                               driver_age >= 51 & driver_age <= 60 ~ "51-60",
                               driver_age >= 61 & driver_age <= 70 ~ "61-70",
                               driver_age >= 71 & driver_age <= 80 ~ "71-80",
                               driver_age >= 81 & driver_age <= 90 ~ "81-90"))
inc_age <- inc_age %>%
  group_by(age_group, driver_incident, EmployeeID) %>%
  summarise(driver_age_group = n_distinct(EmployeeID)) %>%
  filter(driver_incident == "Yes") %>%
  group_by(age_group) %>%
  summarise(num_drv_inc = sum(driver_age_group))
## `summarise()` has grouped output by 'age_group', 'driver_incident'. You can override using the `.groups` argument.
drvr_info_age <- driver_info %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears())
drvr_info_age$driver_age <- round(drvr_info_age$driver_age, digits = 0)
drvr_info_age <- drvr_info_age %>%
  mutate(age_group = case_when(driver_age >= 20 & driver_age <= 30 ~ "20-30",
                               driver_age >= 31 & driver_age <= 40 ~ "31-40",
                               driver_age >= 41 & driver_age <= 50 ~ "41-50",
                               driver_age >= 51 & driver_age <= 60 ~ "51-60",
                               driver_age >= 61 & driver_age <= 70 ~ "61-70",
                               driver_age >= 71 & driver_age <= 80 ~ "71-80",
                               driver_age >= 81 & driver_age <= 90 ~ "81-90"))
drvr_info_age <- drvr_info_age %>%
  group_by(age_group) %>%
  summarise(num_of_drvr = n())

inc_age %>%
  full_join(drvr_info_age, by = "age_group") %>%
  mutate(has_incident_age = (num_drv_inc/num_of_drvr)*100) %>%
  mutate(across(has_incident_age, round, 2)) %>%
  ggplot(aes(x = age_group, y = has_incident_age, fill = age_group)) +
  geom_bar(stat = "identity") + 
  geom_text(aes(label = has_incident_age), vjust=1.6, color="white", size=2.5) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = "Age group", y = "Drivers had claims", fill = "Age group", title = "Proportion of drivers had claims by Age")

inc_severity_age <- severity_inc %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears(),
         driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes")) 

inc_severity_age$driver_age <- round(inc_severity_age$driver_age, digits = 0)
inc_severity_age <- inc_severity_age %>%
  mutate(age_group = case_when(driver_age >= 20 & driver_age <= 30 ~ "20-30",
                               driver_age >= 31 & driver_age <= 40 ~ "31-40",
                               driver_age >= 41 & driver_age <= 50 ~ "41-50",
                               driver_age >= 51 & driver_age <= 60 ~ "51-60",
                               driver_age >= 61 & driver_age <= 70 ~ "61-70",
                               driver_age >= 71 & driver_age <= 80 ~ "71-80",
                               driver_age >= 81 & driver_age <= 90 ~ "81-90"))

inc_severity_age <- inc_severity_age %>%
  group_by(age_group, driver_incident, EmployeeID, severity) %>%
  summarise(driver_age_group = n_distinct(EmployeeID)) %>%
  filter(driver_incident == "Yes") %>%
  group_by(age_group, severity) %>%
  summarise(num_drv_inc = sum(driver_age_group))
## `summarise()` has grouped output by 'age_group', 'driver_incident', 'EmployeeID'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'age_group'. You can override using the `.groups` argument.
drvr_info_age <- driver_info %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears())
drvr_info_age$driver_age <- round(drvr_info_age$driver_age, digits = 0)
drvr_info_age <- drvr_info_age %>%
  mutate(age_group = case_when(driver_age >= 20 & driver_age <= 30 ~ "20-30",
                               driver_age >= 31 & driver_age <= 40 ~ "31-40",
                               driver_age >= 41 & driver_age <= 50 ~ "41-50",
                               driver_age >= 51 & driver_age <= 60 ~ "51-60",
                               driver_age >= 61 & driver_age <= 70 ~ "61-70",
                               driver_age >= 71 & driver_age <= 80 ~ "71-80",
                               driver_age >= 81 & driver_age <= 90 ~ "81-90"))
drvr_info_age <- drvr_info_age %>%
  group_by(age_group) %>%
  summarise(num_of_drvr = n())

svr_age <- inc_severity_age %>%
  full_join(drvr_info_age, by = "age_group") %>%
  mutate(has_incident_age = (num_drv_inc/num_of_drvr)*100) %>%
  mutate(across(has_incident_age, round, 2)) %>%
  pivot_wider(names_from = severity, values_from = has_incident_age) %>%
  mutate(serious = replace_na(serious, 0),
         slight = replace_na(slight, 0))

plot_ly(svr_age, x = ~age_group, mode='markers') %>%
  add_trace(y = ~serious, name = 'serious', type='bar', mode='markers') %>%
  add_trace(y = ~slight, name = 'slight', type='bar', mode='markers') %>%
  layout(
    title = "Proportion of drivers had claims by Age group and Severity",
    xaxis = list(title = "Age group"),
    yaxis = list(title = "Drivers had claims (%)",
                 range=c(0,100)),
    updatemenus = list(
      list(
        type = "buttons",
        label = 'Category',
        buttons = list(
          list(method = "restyle",
               args = list('visible', c(TRUE, FALSE, FALSE)),
               label = "Serious Only"),
          list(method = "restyle",
               args = list('visible', c(FALSE, TRUE, FALSE)),
               label = "Slight Only"),
          list(method = "restyle",
               args = list("visible", TRUE),
               label = "Show All")
        )
      )
    )
  )
## Warning: 'bar' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

## Warning: 'bar' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'meta', 'selectedpoints', 'hoverinfo', 'hoverlabel', 'stream', 'transforms', 'uirevision', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'xperiod', 'yperiod', 'xperiod0', 'yperiod0', 'xperiodalignment', 'yperiodalignment', 'text', 'texttemplate', 'hovertext', 'hovertemplate', 'textposition', 'insidetextanchor', 'textangle', 'textfont', 'insidetextfont', 'outsidetextfont', 'constraintext', 'cliponaxis', 'orientation', 'base', 'offset', 'width', 'marker', 'offsetgroup', 'alignmentgroup', 'selected', 'unselected', 'r', 't', '_deprecated', 'error_x', 'error_y', 'xcalendar', 'ycalendar', 'xaxis', 'yaxis', 'idssrc', 'customdatasrc', 'metasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'texttemplatesrc', 'hovertextsrc', 'hovertemplatesrc', 'textpositionsrc', 'basesrc', 'offsetsrc', 'widthsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

MVA

age_mva_inc <- incidents %>%
  filter(grepl('MVA|Motor', ClaimCategoryName)) %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears(),
         driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes"))
age_mva_inc$driver_age <- round(age_mva_inc$driver_age, digits = 0)
age_mva_inc <- age_mva_inc %>%
  mutate(age_group = case_when(driver_age >= 20 & driver_age <= 30 ~ "20-30",
                               driver_age >= 31 & driver_age <= 40 ~ "31-40",
                               driver_age >= 41 & driver_age <= 50 ~ "41-50",
                               driver_age >= 51 & driver_age <= 60 ~ "51-60",
                               driver_age >= 61 & driver_age <= 70 ~ "61-70",
                               driver_age >= 71 & driver_age <= 80 ~ "71-80",
                               driver_age >= 81 & driver_age <= 90 ~ "81-90")) %>%
  group_by(age_group, driver_incident, EmployeeID) %>%
  summarise(driver_age_group = n_distinct(EmployeeID)) %>%
  group_by(age_group, driver_incident) %>%
  summarise(num_drvage_inc = sum(driver_age_group))
## `summarise()` has grouped output by 'age_group', 'driver_incident'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'age_group'. You can override using the `.groups` argument.
tot_drv_age <- driver_info %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears())

tot_drv_age$driver_age <- round(tot_drv_age$driver_age, digits = 0)

tot_drv_age <- tot_drv_age %>%
  mutate(age_group = case_when(driver_age >= 20 & driver_age <= 30 ~ "20-30",
                               driver_age >= 31 & driver_age <= 40 ~ "31-40",
                               driver_age >= 41 & driver_age <= 50 ~ "41-50",
                               driver_age >= 51 & driver_age <= 60 ~ "51-60",
                               driver_age >= 61 & driver_age <= 70 ~ "61-70",
                               driver_age >= 71 & driver_age <= 80 ~ "71-80",
                               driver_age >= 81 & driver_age <= 90 ~ "81-90")) %>%
  distinct(EmployeeID, age_group) %>%
  group_by(age_group) %>%
  summarise(num_of_drvr = n())

age_inc_tab <- age_mva_inc %>%
  pivot_wider(names_from = driver_incident, values_from = num_drvage_inc) %>%
  left_join(tot_drv_age, by = "age_group") %>%
  mutate(has_incident_age = (Yes/num_of_drvr)*100) %>%
  mutate(across(has_incident_age, round, 2)) %>%
  rename(`age group` = age_group, incident = Yes, `no incident` = No, total = num_of_drvr, prop = has_incident_age) 

# ordinal e.g. 20-30, 41-50, 51-60, 61-70. Same contingency format
age_inc_tab %>%
  kable() %>%
  kable_styling()
age group no incident incident total prop
20-30 67 36 103 34.95
31-40 329 527 856 61.57
41-50 387 525 912 57.57
51-60 460 674 1134 59.44
61-70 394 566 960 58.96
71-80 139 119 258 46.12
81-90 15 2 17 11.76

Driver’s age doesn’t explain much of the occurrence of incidents.

bin_age <- incidents %>%
  filter(grepl('MVA|Motor', ClaimCategoryName)) %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears(),
         driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes"),
         driver_age = round(driver_age,digits=0),
         age_group = case_when(driver_age >= 20 & driver_age <= 30 ~ "20-30",
                               driver_age >= 31 & driver_age <= 40 ~ "31-40",
                               driver_age >= 41 & driver_age <= 50 ~ "41-50",
                               driver_age >= 51 & driver_age <= 60 ~ "51-60",
                               driver_age >= 61 & driver_age <= 70 ~ "61-70",
                               driver_age >= 71 & driver_age <= 80 ~ "71-80",
                               driver_age >= 81 & driver_age <= 90 ~ "81-90")) %>%
  mutate(binary_outcome = case_when(
    driver_incident == 'Yes' ~ 1,
    TRUE ~ 0
  ))

model_fit <- glm(binary_outcome ~ age_group, data=bin_age, family = "binomial") #double check this is logistic regression, not linear regression
summary(model_fit)
## 
## Call:
## glm(formula = binary_outcome ~ age_group, family = "binomial", 
##     data = bin_age)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0212   0.5271   0.5392   0.5759   1.6651  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.3904     0.1582   2.468  0.01359 *  
## age_group31-40   1.3223     0.1692   7.817 5.41e-15 ***
## age_group41-50   1.1627     0.1678   6.929 4.24e-12 ***
## age_group51-60   1.4645     0.1660   8.825  < 2e-16 ***
## age_group61-70   1.5133     0.1672   9.053  < 2e-16 ***
## age_group71-80   1.0603     0.1841   5.758 8.51e-09 ***
## age_group81-90  -1.4890     0.5401  -2.757  0.00583 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10023.7  on 11724  degrees of freedom
## Residual deviance:  9890.9  on 11718  degrees of freedom
## AIC: 9904.9
## 
## Number of Fisher Scoring iterations: 4
# find anova function that gives p value breakdown - something like 
age_mva_inc1 <- incidents %>%
  filter(grepl('MVA|Motor', ClaimCategoryName)) %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  right_join(driver_info, by = "EmployeeID") %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears(),
         driver_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes"))
age_mva_inc1$driver_age <- round(age_mva_inc1$driver_age, digits = 0)
yo_70_inc <- age_mva_inc1 %>%
  mutate(age_group = case_when(driver_age < 70 ~ "Younger than 70",
                               driver_age >= 70 ~ "Older than 70")) %>%
  group_by(age_group, driver_incident, EmployeeID) %>%
  summarise(driver_age_group = n_distinct(EmployeeID)) %>%
  group_by(age_group, driver_incident) %>%
  summarise(num_drvage_inc = sum(driver_age_group))
## `summarise()` has grouped output by 'age_group', 'driver_incident'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'age_group'. You can override using the `.groups` argument.
tot_drv_age1 <- driver_info %>%
  mutate(driver_age = today() - dmy(BirthDate),
         driver_age = driver_age/dyears())

tot_drv_age1$driver_age <- round(tot_drv_age1$driver_age, digits = 0)

tot_drv_age1 <- tot_drv_age1 %>%
  mutate(age_group = case_when(driver_age < 70 ~ "Younger than 70",
                               driver_age >= 70 ~ "Older than 70")) %>%
  distinct(EmployeeID, age_group) %>%
  group_by(age_group) %>%
  summarise(num_of_drvr = n())

yo70_inc_tab <- yo_70_inc %>%
  pivot_wider(names_from = driver_incident, values_from = num_drvage_inc) %>%
  left_join(tot_drv_age1, by = "age_group") %>%
  mutate(has_incident_age = (Yes/num_of_drvr)*100) %>%
  mutate(across(has_incident_age, round, 2)) %>%
  rename(`age group` = age_group, incident = Yes, `no incident` = No, total = num_of_drvr, prop = has_incident_age) 


yo70_inc_tab %>%
  kable() %>%
  kable_styling()
age group no incident incident total prop
Older than 70 179 159 338 47.04
Younger than 70 1612 2290 3902 58.69
test_x2 <- as.matrix(yo70_inc_tab %>%
                      ungroup() %>%
                      select(incident, `no incident`))
CLTest_yo70 <- prop.test(test_x2)
tidy(CLTest_yo70)
## # A tibble: 1 x 9
##   estimate1 estimate2 statistic  p.value parameter conf.low conf.high method    
##       <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>     
## 1     0.470     0.587      16.8  4.11e-5         1   -0.173   -0.0594 2-sample …
## # … with 1 more variable: alternative <chr>

May be less data for older than 70. Can’t conclude that drivers older than 70 have less incidents.

2) On what day have most incidents been found? Which hours are the peak and trough?

incidents %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  mutate(day_week = weekdays(as.Date(DateOccurred))) %>%
  group_by(day_week) %>%
  summarise(total_cases = n()) %>%
  ggplot(aes(x = factor(day_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), total_cases)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  geom_text(aes(label=total_cases), vjust=1.6, color="white", size=3.5)+
    theme_minimal() +
  labs(y = "Total claim number", x = "Day", title = "Total claim by day of the week")

incidents %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  mutate(day_week = weekdays(DateOccurred),
         hour = hour(DateOccurred)) %>%
  group_by(day_week, hour) %>%
  summarise(total_cases = n()) %>%
  ggplot(aes(x = hour, y = total_cases)) +
  geom_line() +
  facet_wrap(~factor(day_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) +
    theme_minimal() +
  labs(y = "Total claim number", x = "Hour", title = "Total claim by daily hours")
## `summarise()` has grouped output by 'day_week'. You can override using the `.groups` argument.

3) Is it more dangerous for the bus to travel at night or during the daytime?

daynight_incidents <- incidents %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  mutate(hour = hour(DateOccurred))

daynight_incidents %>%
  mutate(day_night = case_when(hour >= 19 | hour %in% c(0, 1, 2, 3, 4) ~ "night",
                               hour < 19 ~ "day")) %>%
  group_by(day_night) %>%
  summarise(total_cases = n()) %>%
  ggplot(aes(x = day_night, y = total_cases, fill = day_night)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label=total_cases), vjust=1.6, color="white", size=3.5) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(x = "Daytime/night", y = "Total claim number", title = "Total claims made during daytime and night")

4) What kind of weather (temperature and precipitation) is considered safer or dangerous for bus drivers?

dh_mva_incidents <- incidents %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  mutate(date_hour = format(DateOccurred, format = "%Y-%m-%d %H")) %>%
  filter(company %in% c("TDNSW", "TDM"),
         grepl('MVA|Motor', ClaimCategoryName))

melbourneweather <- melbourneweather %>%
  mutate(company = "TDM",
         date_hour = format(timestamp, format = "%Y-%m-%d %H")) %>%
  rename(Temperature = `Melbourne Temperature`, Precipitation = `Melbourne Precipitation Total`)

sydneyweather <- sydneyweather %>%
  mutate(company = "TDNSW",
         date_hour = format(timestamp, format = "%Y-%m-%d %H")) %>%
  rename(Temperature = `Sydney Temperature`, Precipitation = `Sydney Precipitation Total`)

mel_syd_weather <- rbind(melbourneweather, sydneyweather)

weather_n_incidents <- mel_syd_weather %>%
  mutate(hour = hour(timestamp)) %>%
  left_join(dh_mva_incidents, by = c("company", "date_hour")) %>%
  mutate(Temperature = as.numeric(Temperature),
         Precipitation = as.numeric(Precipitation),
         rain = case_when(Precipitation >= 2.5 ~ "Yes",
                          Precipitation < 2.5 ~ "No"),
         has_incident = case_when(is.na(ClaimNumber) ~ "No",
                                         !is.na(ClaimNumber) ~ "Yes"))

weather_n_incidents %>%
  filter(rain == "Yes") %>%
  ggplot(aes(x = Precipitation, group = has_incident, fill = has_incident)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~company, nrow = 2) +
  theme_minimal() +
  theme(legend.title = element_blank()) +
  scale_fill_discrete(labels=c("Non-incident", "Incident")) +
  labs(x = "Rainfall (mm)", title = "Incident and non-incident distribution under rainy condition measured by precipitation") +
  expand_limits(x = -0.5)

weather_n_incidents %>%
  filter(rain == "No") %>%
  ggplot(aes(x = Precipitation, group = has_incident, fill = has_incident)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~company, nrow = 2) +
  theme_minimal() +
  theme(legend.title = element_blank()) +
  scale_fill_discrete(labels=c("Non-incident", "Incident")) +
  labs(x = "Rainfall (mm)", title = "Incident and non-incident distribution under little/no rain condition measured by precipitation") +
  expand_limits(x = -0.2:2.4)

Rain distribution can do t-test, test significance. NOTE: look up requirements for t-test. two-sampled t-test

# compare precipitation between has_incident=yes and has_incident=no
rain_inc <- weather_n_incidents %>%
  select(Precipitation, has_incident) 
# has_incident = yes
summary(rain_inc %>%
          filter(has_incident == "Yes") %>%
          .$Precipitation)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05546 0.00000 9.40000
summary(rain_inc %>%
          filter(has_incident == "No") %>%
          .$Precipitation)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0484  0.0000 11.0000
ggplot(rain_inc, aes(Precipitation)) +
        geom_histogram(fill = "white", color = "grey30") +
        facet_wrap(~has_incident) +
        scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 103552 rows containing non-finite values (stat_bin).

wilcox.test(Precipitation ~ has_incident, data = rain_inc)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Precipitation by has_incident
## W = 522922546, p-value = 0.0008991
## alternative hypothesis: true location shift is not equal to 0
# is associated, relationship
weather_n_incidents %>%
  ggplot(aes(x = Temperature, group = has_incident, fill = has_incident)) +
  geom_density(alpha = 0.5) +
  facet_grid(~company) +
  theme_minimal() +
  theme(legend.title = element_blank()) +
  scale_fill_discrete(labels=c("Non-incident", "Incident")) +
  labs(x = "Temperature (°C)", title = "Incident and non-incident distribution measured by temperature")

Temperature:

temp_inc <- weather_n_incidents %>%
  select(Temperature, has_incident) 
# has_incident = yes
summary(rain_inc %>%
          filter(has_incident == "Yes") %>%
          .$Temperature)
## Warning: Unknown or uninitialised column: `Temperature`.
## Length  Class   Mode 
##      0   NULL   NULL
summary(rain_inc %>%
          filter(has_incident == "No") %>%
          .$Temperature)
## Warning: Unknown or uninitialised column: `Temperature`.
## Length  Class   Mode 
##      0   NULL   NULL
ggplot(temp_inc, aes(Temperature)) +
        geom_histogram(fill = "white", color = "grey30") +
        facet_wrap(~has_incident) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

t.test(Temperature ~ has_incident, data = temp_inc)
## 
##  Welch Two Sample t-test
## 
## data:  Temperature by has_incident
## t = -28.735, df = 12354, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.200039 -1.919052
## sample estimates:
##  mean in group No mean in group Yes 
##          16.57065          18.63019
# the existence of significance difference

5) Is there a relationship between drivers’ driving hours within a day and the likelihood of getting into an incident?

tmp2_when_occur <- tmp2 %>%
  filter(match_type == 2) %>%
  distinct(EmployeeID, ClaimCategoryName, DateOccurred, .keep_all = TRUE) %>%
  mutate(diff_occur_start = difftime(DateOccurred, shift_start),
         diff_occur_start = diff_occur_start/dhours(),
         hour = hour(DateOccurred),
         day_night = case_when(hour >= 19 | hour %in% c(0, 1, 2, 3, 4) ~ "Night",
                               hour < 19 ~ "Day")) %>%
  filter(grepl('MVA|Motor', ClaimCategoryName))
tmp2_when_occur$diff_occur_start <- round(tmp2_when_occur$diff_occur_start, digits = 0) 

ggplot(tmp2_when_occur, aes(x = diff_occur_start)) +
  geom_histogram(aes(y = ..density..), bins = 25, alpha = 0.4) +
  geom_density() +
  facet_grid(~day_night) +
  theme_minimal() +
  labs(x = "Number of hour(s) after shift start", title = "Hour(s) that claims made after drivers' driving hours in NSW")

FIT a model to predict incident, understand the relationship between variables. Random forest (suggested), can discover non-linear relationship, build tree diagrams, get variable importance. e.g. given an hour, is it going to be incident or not. variables e.g. gender, hour of driving, weather, time of day.

# fix the original random forest dataframe
nswshifts1 <- nswshifts %>%
  rename(shift_start = VdsTripStartTime, shift_end = VdsTripEndTime) %>%
  mutate(JoinDate = as.Date(shift_start)) %>%
  select(EmployeeID, shift_start, shift_end, JoinDate)

incident1 <- incidents %>%
  drop_na(EmployeeNumber) %>%
  distinct(EmployeeID, ClaimCategoryId, DateOccurred, .keep_all = TRUE) %>%
  filter(company == "TDNSW", grepl('MVA|Motor', ClaimCategoryName)) %>%
  mutate(JoinDate = as.Date(DateOccurred)) %>%
  group_by(EmployeeID, JoinDate) %>%
  select(EmployeeID, DateOccurred, JoinDate)

nswshifts_lj_inc <- nswshifts1 %>%
  group_by(EmployeeID, JoinDate) %>%
  summarise(shift_start = min(shift_start)-30*60,
            shift_end = max(shift_end)+30*60) %>%
  left_join(incident1, by = c("EmployeeID", "JoinDate")) %>%
  select(EmployeeID, shift_start, shift_end, DateOccurred) %>%
  mutate(has_incident = case_when(is.na(DateOccurred) ~ "No",
                                         !is.na(DateOccurred) ~ "Yes"))
## `summarise()` has grouped output by 'EmployeeID'. You can override using the `.groups` argument.
long_shifts_inc <- nswshifts_lj_inc %>%
  mutate(shift_date = as.Date(shift_start),
         shift_start_hour = hour(shift_start),
         shift_end_hour = hour(shift_end),
         Occurred_hour = hour(DateOccurred)) %>% 
  mutate(shift_end_hour = case_when(shift_end_hour < shift_start_hour ~ 24+shift_end_hour,
                                    TRUE ~ as.numeric(shift_end_hour)),
         trip_duration = difftime(shift_end, shift_start),
         trip_duration = trip_duration/dhours()) %>%
  filter(trip_duration < 24,
         shift_start_hour < 22, shift_start_hour > 4) %>%
  # creating list of shift hours between start and end hour & then unnesting the dataframe to make it long
  rowwise() %>%
  mutate(shift_hour_of_day = list(c(shift_start_hour:shift_end_hour))) %>%
  unnest(cols = c(shift_hour_of_day)) %>%
  mutate(shift_hour_for_driver = shift_hour_of_day - shift_start_hour) %>%
  # some further wrangling to remove unneccessary columns & convert incident hour into hourly indicator & restrict inc
  mutate(incident_indicator = case_when(
    Occurred_hour == shift_hour_of_day ~ 1,
    TRUE                               ~ 0
  )) %>%
  select(-shift_start, -shift_end, -DateOccurred, -Occurred_hour) 

sydneyweather_dh <- sydneyweather %>%
  mutate(date = as.Date(timestamp),
         hour = hour(timestamp),
         hour = case_when(hour == 0 ~ 24,
                          TRUE ~ as.numeric(hour)))

nsw_fit_df <- long_shifts_inc %>%
  left_join(sydneyweather_dh, by = c("shift_date" = "date", "shift_hour_of_day" = "hour")) %>%
  left_join(driver_info, by = "EmployeeID") %>%
  ungroup() %>%
  select(-EmployeeID, -has_incident, -shift_date, -shift_start_hour, -shift_end_hour, -trip_duration, -timestamp, -Depot, -CompanyID, -EmployeeStartDate, -EmploymentTerminationDate, -Position, -ContractType, -FTE, -BirthDate, -company, -date_hour) 
set.seed(2021)

nsw_fit_df <- nsw_fit_df %>%
  na.omit() %>%
  mutate(incident_indicator = factor(incident_indicator)) 
nsw_fit_split <- initial_split(nsw_fit_df, 2/3,
                                         strata = incident_indicator)
nsw_fit_tr <- training(nsw_fit_split) %>% 
  mutate(Gender = factor(Gender)) %>% 
  mutate(Precipitation = as.numeric(Precipitation)) %>%
  mutate(Temperature = as.numeric(Temperature))
nsw_fit_ts <- testing(nsw_fit_split)

up_train <- upSample(x = nsw_fit_tr[, -ncol(nsw_fit_tr)],
                     y = nsw_fit_tr$incident_indicator)                         
table(up_train$Class) 
## 
##     0     1 
## 39994 39994
nsw_fit_rf <- randomForest(incident_indicator ~ shift_hour_of_day+shift_hour_for_driver+Temperature+Precipitation, data = up_train, importance = TRUE)
print(nsw_fit_rf)
## 
## Call:
##  randomForest(formula = incident_indicator ~ shift_hour_of_day +      shift_hour_for_driver + Temperature + Precipitation, data = up_train,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.35%
## Confusion matrix:
##       0     1 class.error
## 0 27377 12617  0.31547232
## 1  2864 37130  0.07161074
nsw_fit_ts_pred <- nsw_fit_ts %>%
  mutate(.pred = predict(nsw_fit_rf, nsw_fit_ts, type = "prob"))

#caret package, fix classes to be balanced, rerun random forest, check classification matrix
# train random forest on different error functions false positive rate or recall/precision



table(up_train$incident_indicator)
## 
##     0     1 
## 39994 39994
importance(nsw_fit_rf)
##                               0        1 MeanDecreaseAccuracy MeanDecreaseGini
## shift_hour_of_day     -17.73132 197.7358             240.1236         2609.732
## shift_hour_for_driver -15.10100 180.5437             223.1298         3136.903
## Temperature           -11.60708 152.7067             150.2549        10848.697
## Precipitation          16.30465 288.8519             287.9000         1045.711
varImpPlot(nsw_fit_rf)

plot_predict_interaction(nsw_fit_rf, up_train, "shift_hour_of_day", "shift_hour_for_driver")

plot_predict_interaction(nsw_fit_rf, up_train, "Temperature", "Precipitation")

Mean Decrease Accuracy: How much the model accuracy decreases if we drop that variable. Mean Decrease Gini: Measure of variable importance based on Gini impurity index used for the calculation of splits in trees.